1. Introduction

2. Methodology

The goal of the statistical modelling in this study is to find answers to three key questions:

  1. How accurately can the rankings be predicted given the dependent variables?
  2. What are the most important features for the predictions?
  3. What is the direction of the impact?

The method of choice for the study was gradient boosted decision trees (GBDT). GBDT is a widely used machine learning technique which can be used in many settings from regression or classification to learning to rank type of problems. In a learning to rank problem, there is a ordered list of items and the goal for the model is to calculate a score for each item based on dependent variables such that the original order is retained.

In process of building the model, data set was split to two folds: train data (containing around 70% of searches) and test data (the rest of the data, about 30%). GBDT model was fitted using training data, predictions were calculated for the test data set, and then finally predictions were compared to real observed rankings. The chosen evaluation metric was Spearman’s rank correlation coefficient. Spearman’s rank correlation is a scaled measurement of the agreement of two rankings. Perfectly matching rankings would give value of 1, the expected value for random rankings is zero and reverse order would have value of -1.

The next step is to understand why the model makes particular predictions; what are the most important dependent variables and how their values effect the predictions? For this purpose SHapley Additive exPlanations (SHAP) values were calculated. In SHAP each prediction is presented as a sum of each dependent variable’s responsibility. Then the overall impact of any particular variable can be measured as a average of absolute values over the whole data set.

Model results

A closer look at the features

The depended variables used in this study can be roughly organized into five main groups, these are listed below and also showing a few important variables suggested by SHAP values.

In terms of SEO, the first two categories are not much of a interest as they are something difficult or even impossible to change or adjust, but the last three are more interesting and worth further investigation.

Type category

Basic information about type categories
Type Value
Total unique categories 72
Missing type category 1.99%
Categories with more than >=10 results 26
Categories with more than >=100 results 13
Categories with more than >=1000 results 3
Median unique categories in one search 4
Min unique categories in one search 1
Max unique categories in one search 12

Key takeaways:

Title and description

Basic information about titles and descriptions
Type Title Description
Median character length (non missing) 24 534
Min character length (non missing) 4 8
Max character length (non missing) 125 752
Missing 0.01% 40.7%
Containing lawyer or attorney 22.65% 43.57%
Containing car accident or personal injury 5.31% 44.7%
Containing city name 5% 27.07%

Key takeaways:

Reviews

Basic information about reviews
Type Value
Median #reviews 14
Max #reviews 968
No reviews available 16.59%
Average rating 4.61
Response ratio by owners 33.43%
Average number of likes per review 0.66

Key takeaways:

Provided updates and number of photos

Basic information about #photos and Google updates
Type Value
Median #photos 5
Max #photos 540
Zero #photos 5.78%
Provides Google updates 54.79%